Machine Learning: Mathematical Theory and Applications

Computer lab 4

Group 12 - Hugo Xinghe CHEN & Jianhan WANG

Published

October 30, 2024

Warning

The instructions in this lab assume that the following package is installed:

  • caret

Packages may be installed by executing install.packages('packagename'), where 'packagename' is the name of the package, e.g. 'splines'. You may have to install additional packages to solve the computer lab. If you want this file to run locally on your computer, you have to change some paths to where the data are stored (see below).

Introduction

This computer lab treats topics such as supervised learning, unsupervised learning, and semi-supervised learning.


Instructions

In this computer lab, you will work in groups of two. The group hands in a single set of solutions. It is strictly forbidden to copy codes or solutions from other groups, as well as the use of AI tools (such as ChatGPT). Not obeying these instructions is regarded as academic misconduct and may have serious consequences for you.

This computer lab is worth a total of 15 marks (the subject has a maximum of 100 marks). The maximum marks for a given problem is shown within parenthesis at the top of the section. The problems you should solve are marked with the symbol 💪 and surrounded by a box. Hand in the solutions and programming codes in the form of a html document generated by Quarto. Before submitting, carefully check that the document compiles without any errors. Use properly formatted figures and programming code.

Warning

Not submitting according to the instructions may result in loss of marks. The same applies to poorly formatted reports (including poorly formatted code/figures, unnecessarily long outputs, etc.).

1. Linear discriminant analysis for cancer data (2 marks)

In this problem, we consider a generative classification model for breast cancer data, where features computed from 569 digitised images are modelled and subsequently used to classify the tumor type (malign or benign). The data is stored in the file cancer_data_10features.Rdata, which can be downloaded from the Canvas page of the course1.

We will consider the two features radius (mean of distances from center to points on the perimeter ) and texture (standard deviation of gray-scale values). Your task in this problem is to implement a linear discriminant analysis method without using packages (such as the lda() function from the MASS package).

The following code reads in the data, creates the training and test datasets, and visualises these data for the classification problem.

rm(list=ls()) # Remove variables 
cat("\014") # Clean workspace
set.seed(1234)
suppressMessages(library(caret))
load(file = '/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/4/cancer_data_10features.RData')
train_obs <- createDataPartition(y = cancer_data_10features$diagnosis, p = .80, list = FALSE)
train <- cancer_data_10features[train_obs, 1:3]
test <- cancer_data_10features[-train_obs, 1:3]
# Confirm both training and test are balanced with respect to diagnosis (malign tumor)
cat("Percentage of training data consisting of malign tumors:", 
              100*mean(train$diagnosis == "M"))
Percentage of training data consisting of malign tumors: 37.2807
cat("Percentage of test data consisting of malign tumors:", 
              100*mean(test$diagnosis == "M"))
Percentage of test data consisting of malign tumors: 37.16814
# Train and test for each tumor class
ind_train_M <- train$diagnosis == "M"
ind_test_M <- test$diagnosis == "M"
plot(train[ind_train_M, 2], train[ind_train_M, 3], xlab = "Radius", ylab = "Texture", col = "cornflowerblue", xlim = c(5, 30),  ylim = c(5, 40))
lines(train[!ind_train_M, 2], train[!ind_train_M, 3], col = "lightcoral", type = "p")
lines(test[ind_test_M, 2], test[ind_test_M, 3], col = "cornflowerblue", type = "p", pch = "x")
lines(test[!ind_test_M, 2], test[!ind_test_M, 3], col = "lightcoral", type = "p", pch = "x")
legend("topright", legend = c("Malign train", "Benign train", "Malign test", "Benign test"), col = c("cornflowerblue", "lightcoral", "cornflowerblue", "lightcoral"), pch = c("o", "o", "x", "x"))

Assume that the class conditionals \(p(\mathbf{x}|y)\), for \(y\in\{1, 2\}\) (1-malign, 2-benign), follow a bivariate normal distribution with different means, however, with the same covariance matrix, i.e. \[\mathbf{x}|y=m \sim \mathcal{N}(\boldsymbol{\mu}_m,\boldsymbol{\Sigma}),\,\,m=1,2.\]

Let \(\pi_m=\Pr(y=m)\) for \(m=1,2\). These \(\pi_m\) probabilities can be interpreted as the marginal (marginalising out \(\mathbf{x}\) through \(\Pr(y=m)=\int p(\mathbf{x}, y=m)\mathbf{d}\mathbf{x}\)) class membership probabilities, i.e. unconditional \(\mathbf{x}\). Given an \(\mathbf{x}=(x_1, x_2)^\top\), we want to compute its class membership probability. From Bayes’ theorem, it follows that the conditional distribution of the class membership is \[\Pr(y=m|\mathbf{x})=\frac{p(\mathbf{x}|y=m)\Pr(y=m)}{p(\mathbf{x})}\propto p(\mathbf{x}|y=m)\Pr(y=m),\,\,m=1,2.\]

The denominator above is the marginal density of \(\mathbf{x}\) (marginalising out \(y\) through \(p(\mathbf{x})=\sum_m p(\mathbf{x}, y=m)\)), which normalises \(\Pr(y=m|\mathbf{x})\) so that \(\sum_m\Pr(y=m|\mathbf{x})=1\).

💪 Problem 1.1

Derive (analytically) \(p(\mathbf{x})\) for the example above.

suppressMessages(library(grid))
suppressMessages(library(jpeg))
img <- readJPEG("/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/4/Problem1_1.jpg")
grid.raster(img)

💪 Problem 1.2

Given the training data, estimate the parameters \(\pi_m,\boldsymbol{\mu}_m\) for \(m=1,2\), and \(\boldsymbol{\Sigma}\). Predict the labels of the test data and plot them in a scatter plot with different colors to represent the different classes.

Tip

Let \(\mathbf{x}_\star=(x_{\star1}, x_{\star2})^\top\) be a point we want to predict \(y\) for and let \(\widehat{y}(\mathbf{x}_\star)\) denote the prediction. The prediction is obtained as \[\widehat{y}(\mathbf{x}_\star)= \arg \max_m \Pr(y=m|\mathbf{x})=\arg \max_m \log \Pr(y=m|\mathbf{x}).\]It can be shown that for the model above, \[\widehat{y}(\mathbf{x}_\star)=\arg\max_m \left\{ \log p(\mathbf{x}_\star|y=m) + \log\pi_m \right\}=\arg\max_m \left\{ \mathbf{x}_{\star}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_{m} - \frac{1}{2}\boldsymbol{\mu}_{m}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_{m}+\log \pi_{m}\right\}.\]

mu_M <- colMeans(train[ind_train_M, 2:3])
mu_B <- colMeans(train[!ind_train_M, 2:3])
cov_matrix <- cov(train[, 2:3])

pi_M <- mean(train$diagnosis == "M")
pi_B <- mean(train$diagnosis == "B")

discriminant_lda <- function(x_new, mu, cov_matrix, pi) {
  return(t(x_new) %*% solve(cov_matrix) %*% mu - 0.5 * t(mu) %*% solve(cov_matrix) %*% mu + log(pi))
}

predict_class_lda <- function(x_new) {
  score_M <- discriminant_lda(x_new, mu_M, cov_matrix, pi_M)
  score_B <- discriminant_lda(x_new, mu_B, cov_matrix, pi_B)
  
  if (score_M > score_B) {
    return("M")
  } else {
    return("B")
  }
}

predictions <- apply(test[, 2:3], 1, function(row) predict_class_lda(as.matrix(row)))

test$prediction <- as.factor(predictions)
plot(test[, 2], test[, 3], col = ifelse(test$prediction == "M", "red", "black"), 
     main = "Scatter Plot of Predicted Classes", 
     xlab = "Radius", ylab = "Texture", 
     xlim = range(test[, 2], na.rm = TRUE), 
     ylim = range(test[, 3], na.rm = TRUE))
legend("topright", legend = c("Predicted Malign", "Predicted Benign"), col = c("red", "black"), pch = 1)

💪 Problem 1.3

Plot the decision bound of the classifier and the predictions of the test data from Problem 1.2 in the same plot.

Tip

To determine the decision boundary, we can solve the equation \(\log\Pr(y=1|\mathbf{x}_\star)=\log\Pr(y=2|\mathbf{x}_\star)\) for \(\mathbf{x}_\star=(x_{\star1}, x_{\star2})^\top\). This gives \[\mathbf{x}_{\star}^{\top}\boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_{1} - \boldsymbol{\mu}_{2})- \frac{1}{2}\boldsymbol{\mu}_{1}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_{1} + \frac{1}{2}\boldsymbol{\mu}_{2}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_{2}+\log \pi_{1} - \log \pi_{2}=0.\]

After some algebra, we can express the above as \(x_{\star2}=\gamma_0 + \gamma_1x_{\star1}\) for some \(\gamma_0,\gamma_1\) that can be evaluated using the estimated \(\pi_m,\boldsymbol{\mu}_m\), and \(\boldsymbol{\Sigma}\). We can now construct a grid for \(x_{\star1}\) (e.g. x1_grid<-seq(5,30,length.out = 100)) and use \(x_{\star2}=\gamma_0 + \gamma_1x_{\star1}\) to plot the decision bound.

test$prediction <- as.factor(predictions)
plot(test[, 2], test[, 3], col = ifelse(test$prediction == "M", "red", "black"), 
     main = "Scatter Plot of Predicted Classes", 
     xlab = "Radius", ylab = "Texture", 
     xlim = range(test[, 2], na.rm = TRUE), 
     ylim = range(test[, 3], na.rm = TRUE))
legend("topleft", legend = c("Predicted Malign", "Predicted Benign"), col = c("red", "black"), pch = 1)


x1_grid<-seq(5,30,length.out = 100)
x2_boundary <- sapply(x1_grid, function(x1) {
  gamma_1 <- solve(cov_matrix) %*% (mu_M - mu_B)
  gamma_0 <- -0.5 * t(mu_M) %*% solve(cov_matrix) %*% mu_M + 0.5 * t(mu_B) %*% solve(cov_matrix) %*% mu_B + log(pi_M) - log(pi_B)
  x2 <- -as.numeric(gamma_0) / as.numeric(gamma_1[2]) - (gamma_1[1] / gamma_1[2]) * x1
  return(x2)
})
lines(x1_grid, x2_boundary, col = "blue", lwd = 2)
legend("topright", legend = c("Decision Boundary"), col = "blue", lwd = 2)

💪 Problem 1.4

Fit a logistic regression to the training data using the glm() function. Compare the results to the generative model in Problem 1.2. Comment on the results.

Tip

By now you should know many metrics to evaluate a binary classifier.

train$diagnosis_binary <- ifelse(train$diagnosis == "M", 1, 0)

logistic_model <- glm(diagnosis_binary ~ radius + texture, data = train, family = binomial)

test$logistic_prob <- predict(logistic_model, newdata = test, type = "response")

# Classify based on 0.5 threshold
test$logistic_prediction <- ifelse(test$logistic_prob > 0.5, "M", "B")
test$logistic_prediction <- as.factor(test$logistic_prediction)


plot(test[, 2], test[, 3], col = ifelse(test$logistic_prediction == "M", "red", "black"), 
     main = "Scatter Plot of Logistic Regression Predicted Classes with LDA Boundary", 
     xlab = "Radius", ylab = "Texture", 
     xlim = range(test[, 2], na.rm = TRUE), 
     ylim = range(test[, 3], na.rm = TRUE))
legend("topleft", legend = c("Predicted Malign", "Predicted Benign"), col = c("red", "black"), pch = 1)

# Add LDA decision boundary to logistic regression plot
lines(x1_grid, x2_boundary, col = "blue", lwd = 2)
legend("topright", legend = c("LDA Decision Boundary"), col = "blue", lwd = 2)

# Confusion matrix for logistic regression
logistic_conf_matrix <- table(Predicted = test$logistic_prediction, Actual = test$diagnosis)
logistic_accuracy <- sum(diag(logistic_conf_matrix)) / sum(logistic_conf_matrix)
logistic_precision <- logistic_conf_matrix["M", "M"] / sum(logistic_conf_matrix["M", ])
logistic_recall <- logistic_conf_matrix["M", "M"] / sum(logistic_conf_matrix[ , "M"])
logistic_specificity <- logistic_conf_matrix["B", "B"] / sum(logistic_conf_matrix["B", ])

# Confusion matrix for LDA
lda_conf_matrix <- table(Predicted = test$prediction, Actual = test$diagnosis)
lda_accuracy <- sum(diag(lda_conf_matrix)) / sum(lda_conf_matrix)
lda_precision <- lda_conf_matrix["M", "M"] / sum(lda_conf_matrix["M", ])
lda_recall <- lda_conf_matrix["M", "M"] / sum(lda_conf_matrix[ ,"M"])
lda_specificity <- lda_conf_matrix["B", "B"] / sum(lda_conf_matrix["B", ])


print_results <- function(model_name, accuracy, precision, recall, specificity) {
  cat(sprintf(
    "%s Results:\n- Accuracy: %f\n- Precision: %f\n- Recall: %f\n- Specificity: %f\n\n",
    model_name, accuracy, precision, recall, specificity
  ))
}

# LDA Results
print_results("LDA", lda_accuracy, lda_precision, lda_recall, lda_specificity)
LDA Results:
- Accuracy: 0.876106
- Precision: 1.000000
- Recall: 0.666667
- Specificity: 0.835294
# Logistic Regression Results
print_results("Logistic Regression", logistic_accuracy, logistic_precision, logistic_recall, logistic_specificity)
Logistic Regression Results:
- Accuracy: 0.876106
- Precision: 0.850000
- Recall: 0.809524
- Specificity: 0.890411
We can see on the plot that LDA and logistic regression predict a few different classes. They have the same accuracy, but LDA performs better on precision and much worse on recall and specificity.

2. Quadratic discriminant analysis for cancer data (3 marks)

A more flexible model allows a different covariance matrix for each class conditional, i.e. \[\mathbf{x}|y=m \sim \mathcal{N}(\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m).\]It can be shown that \[\widehat{y}(\mathbf{x}_{\star})=\arg\underset{m}{\max}\left(-\frac{1}{2}\log\left|\boldsymbol{\Sigma}_{m}\right|-\frac{1}{2}(\mathbf{x}_{\star}-\boldsymbol{\mu}_{m})^{\top}\boldsymbol{\Sigma}_{m}^{-1}(\mathbf{x}_{\star}-\boldsymbol{\mu}_{m})+\log\pi_{m}\right), \]

where \(\left|\cdot\right|\) denotes the determinant. We note that \(\mathbf{x}_\star\) is quadratic in the expression of the \(\log\Pr(y=m|\mathbf{x}_\star)\), and hence the decision boundary is not linear for this classifier.

💪 Problem 2.1

Derive (analytically) \(p(\mathbf{x})\) with the new class conditional distribution above.

suppressMessages(library(grid))
suppressMessages(library(jpeg))
img <- readJPEG("/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/4/Problem2_1.jpg")
grid.raster(img)

💪 Problem 2.2

Given the training data, estimate the parameters \(\pi_m,\boldsymbol{\mu}_m\), and \(\boldsymbol{\Sigma}_m\) for \(m=1,2\). Predict the labels of the test data and plot them in a scatter plot with different colors to represent the different classes.

Tip

The log of the determinant of a matrix \(A\) can be computed as log(det(A)).

discriminant_qda <- function(x_new, mu, cov_matrix, pi) {
  return(
    -0.5 * log(det(cov_matrix)) - 0.5 * t(x_new - mu) %*% solve(cov_matrix) %*% (x_new - mu) + log(pi)
  )
}

cov_M <- cov(train[ind_train_M, 2:3])
cov_B <- cov(train[!ind_train_M, 2:3])

predict_class_qda <- function(x_new) {
  score_M <- discriminant_qda(x_new, mu_M, cov_M, pi_M)
  score_B <- discriminant_qda(x_new, mu_B, cov_B, pi_B)
  
  if (score_M > score_B) {
    return("M")
  } else {
    return("B")
  }
}

# Predict labels for the test data
predictions <- apply(test[, 2:3], 1, function(row) predict_class_qda(as.matrix(row)))

test$prediction <- as.factor(predictions)
plot(test[, 2], test[, 3], col = ifelse(test$prediction == "M", "red", "black"), 
     main = "Scatter Plot of Predicted Classes", 
     xlab = "Radius", ylab = "Texture", 
     xlim = range(test[, 2], na.rm = TRUE), 
     ylim = range(test[, 3], na.rm = TRUE))
legend("topright", legend = c("Predicted Malign", "Predicted Benign"), col = c("red", "black"), pch = 1)

💪 Problem 2.3

Compare the quadratic discriminant classifier to the linear discriminant and logistic classifiers from Problem 1. Discuss the results.

# QDA predictions from 2.2
test$qda_prediction <- as.factor(predictions)

qda_conf_matrix <- table(Predicted = test$qda_prediction, Actual = test$diagnosis)
qda_accuracy <- sum(diag(qda_conf_matrix)) / sum(qda_conf_matrix)
qda_precision <- qda_conf_matrix["M", "M"] / sum(qda_conf_matrix["M", ])
qda_recall <- qda_conf_matrix["M", "M"] / sum(qda_conf_matrix[, "M"])
qda_specificity <- qda_conf_matrix["B", "B"] / sum(qda_conf_matrix["B", ])


# LDA Results
print_results("LDA", lda_accuracy, lda_precision, lda_recall, lda_specificity)
LDA Results:
- Accuracy: 0.876106
- Precision: 1.000000
- Recall: 0.666667
- Specificity: 0.835294
# Logistic Regression Results
print_results("Logistic Regression", logistic_accuracy, logistic_precision, logistic_recall, logistic_specificity)
Logistic Regression Results:
- Accuracy: 0.876106
- Precision: 0.850000
- Recall: 0.809524
- Specificity: 0.890411
# QDA Results
print_results("QDA", qda_accuracy, qda_precision, qda_recall, qda_specificity)
QDA Results:
- Accuracy: 0.876106
- Precision: 0.850000
- Recall: 0.809524
- Specificity: 0.890411
Here, the QDA model mirrors the logistic regression model's performance. It means, in this case, both model perform better than the LDA model due to higher recall and specificity.

But in order to differ QDA and logistic regression's performance, we'll do a cross-validation in the following code.
set.seed(123)
folds <- createFolds(train$diagnosis, k = 10)

calc_metrics <- function(conf_matrix) {
  accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
  precision <- conf_matrix["M", "M"] / sum(conf_matrix["M", ])
  recall <- conf_matrix["M", "M"] / sum(conf_matrix[, "M"])
  specificity <- conf_matrix["B", "B"] / sum(conf_matrix["B", ])
  return(c(accuracy, precision, recall, specificity))
}

# Initialize metrics storage
metrics_list <- list("LDA" = matrix(0, nrow = 10, ncol = 4),
                     "Logistic" = matrix(0, nrow = 10, ncol = 4),
                     "QDA" = matrix(0, nrow = 10, ncol = 4))
colnames(metrics_list$LDA) <- c("Accuracy", "Precision", "Recall", "Specificity")
colnames(metrics_list$Logistic) <- colnames(metrics_list$QDA) <- colnames(metrics_list$LDA)

# Cross-validation
for (i in 1:10) {
  fold_indices <- folds[[i]]
  train_fold <- train[-fold_indices, ]
  val_fold <- train[fold_indices, ]
  
  # LDA
  mu_M <- colMeans(train_fold[train_fold$diagnosis == "M", 2:3])
  mu_B <- colMeans(train_fold[train_fold$diagnosis == "B", 2:3])
  cov_matrix <- cov(train_fold[, 2:3])
  pi_M <- mean(train_fold$diagnosis == "M")
  pi_B <- mean(train_fold$diagnosis == "B")

  lda_preds <- apply(val_fold[, 2:3], 1, function(row) predict_class_lda(as.matrix(row)))
  lda_conf_matrix <- table(Predicted = lda_preds, Actual = val_fold$diagnosis)
  metrics_list$LDA[i, ] <- calc_metrics(lda_conf_matrix)
  
  # Logistic Regression
  logistic_model <- glm(diagnosis_binary ~ radius + texture, data = train_fold, family = binomial)
  val_fold$logistic_prob <- predict(logistic_model, newdata = val_fold, type = "response")
  logistic_preds <- ifelse(val_fold$logistic_prob > 0.5, "M", "B")
  logistic_conf_matrix <- table(Predicted = logistic_preds, Actual = val_fold$diagnosis)
  metrics_list$Logistic[i, ] <- calc_metrics(logistic_conf_matrix)
  
  # QDA
  cov_M <- cov(train_fold[train_fold$diagnosis == "M", 2:3])
  cov_B <- cov(train_fold[train_fold$diagnosis == "B", 2:3])
  qda_preds <- apply(val_fold[, 2:3], 1, function(row) predict_class_qda(as.matrix(row)))
  qda_conf_matrix <- table(Predicted = qda_preds, Actual = val_fold$diagnosis)
  metrics_list$QDA[i, ] <- calc_metrics(qda_conf_matrix)
}

# Average metrics for each classifier
lda_metrics <- colMeans(metrics_list$LDA)
logistic_metrics <- colMeans(metrics_list$Logistic)
qda_metrics <- colMeans(metrics_list$QDA)

cat("10-Fold Cross-Validation Results:\n\n")
10-Fold Cross-Validation Results:
print_results("LDA", lda_metrics["Accuracy"], lda_metrics["Precision"], lda_metrics["Recall"], lda_metrics["Specificity"])
LDA Results:
- Accuracy: 0.862174
- Precision: 0.964556
- Recall: 0.658824
- Specificity: 0.834067
print_results("Logistic Regression", logistic_metrics["Accuracy"], logistic_metrics["Precision"], logistic_metrics["Recall"], logistic_metrics["Specificity"])
Logistic Regression Results:
- Accuracy: 0.897246
- Precision: 0.896291
- Recall: 0.817647
- Specificity: 0.902064
print_results("QDA", qda_metrics["Accuracy"], qda_metrics["Precision"], qda_metrics["Recall"], qda_metrics["Specificity"])
QDA Results:
- Accuracy: 0.888454
- Precision: 0.919018
- Recall: 0.764706
- Specificity: 0.879406
And we can clearly see that the logistic regression performs the best in general.

💪 Problem 2.4

A doctor contacts you and says she has a patient whose digitised image has the features radius=13.20 and texture=19.22. Use your best classifier from Problem 2.3 to provide the doctor with some advice.

patient_data <- data.frame(radius = 13.20, texture = 19.22)

patient_prob <- predict(logistic_model, newdata = patient_data, type = "response")

patient_prediction <- ifelse(patient_prob > 0.5, "Malign", "Benign")

cat("Prediction for the patient with radius =", patient_data$radius, 
    "and texture =", patient_data$texture, ":\n")
Prediction for the patient with radius = 13.2 and texture = 19.22 :
cat("Predicted class:", patient_prediction, "\n")
Predicted class: Benign 
cat("Probability of malignancy:", round(patient_prob, 3), "\n")
Probability of malignancy: 0.156 
We would advise the doctor that this patient probably has a benign tumor, but the chance of having a malignant tumor is 15.6%.

3. Unsupervised Gaussian mixture models (2 marks)

Both Problems 1 and 2 above are so-called supervised learning methods because the class labels are observed (we know the status of the tumor). In many problems, we do not observe the classes and we might not even know if the classes are interpretable as in the example above (the classes are interpreted as the severity of the tumor). Let us consider two examples to further illustrate what we mean with interpretability.

The first example concerns a dataset that contains measurements of the lengths (in mm) of 1000 insects of a certain species. The data is stored in the file insects.Rdata, which can be downloaded from the Canvas page of the course. The following code reads in and visualises the data.

rm(list=ls()) # Remove variables
cat("\014") # Clean workspace
load(file = '/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/4/insects.RData')
hist(insects$length, xlab = "Length (in mm)", ylab = "Counts", col = "cornflowerblue", main = "Histogram of the lengths of insects")

We see that there seems to be two groups of insects, the first group with lengths about \(\approx1\text{-}6\)mm and the second group with lengths about \(\approx7\text{-}11\)mm. Note that there are no other variables available in the dataset, but a plausible explanation that the data cluster this way is a second variable — sex. This is what we mean with cluster interpretability.

The next example concerns the relationship between the closing price and the trading volume (in log-scale) for the Associated Banc-Corp (ASB) stock that is traded in the New York Stock Exchange (NYSE). The data are on a daily frequency and cover the period 2014-12-26 to 2017-11-10. The data are stored in the file asb.Rdata, which can be downloaded from the Canvas page of the course2. The following code reads in and visualises the data.

load(file = '/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/4/asb.RData')
plot(Close ~ log(Volume), data = asb, col = "cornflowerblue", main = "ASB stock: Closing price vs log of trading volume")

The data seem to be divided into two clusters. The following code plots a time series plot of the closing price.

date <- as.Date(asb$Date)
Close <- asb$Close
plot(date, Close, col = "cornflowerblue",type = "l", main = "ASB stock: Closing prices during 2014-12-26 to 2017-11-10")

💪 Problem 3.1

Give a possible interpretation of the two clusters.

For the ASB stock data, such clustering could represent various regimes of trading or market conditions.

One cluster may reflect the lower volumes of trade associated with the lower prices of the stock, probably in stable periods or lesser interest in the stock.

The other cluster, at higher volumes and prices, would mean the increased investor interest, perhaps in periods of more volatility or news-driven periods.

This interpretation would possibly align with sentiment or volatility of the market.

Further insights of the unsupervised Gaussian mixture model can be gained by understanding the data generating process via simulation. Assume we have only one feature (like the insects example) and two classes. Let

\[x|y=m, \mu_m, \sigma^2_m \sim \mathcal{N}(\mu_m, \sigma^2_m),\,\,m=1,2,\]

with \(\mu_1 = -2,\sigma_1=0.5, \mu_2=4, \sigma_2=1.5\). Assume further that \(\pi_1=\Pr(y=1) = 0.2\) and \(\pi_2= \Pr(y=2) = 0.8\).

💪 Problem 3.2

Derive (analytically) \(p(x)\) for the example above.

suppressMessages(library(grid))
suppressMessages(library(jpeg))
img <- readJPEG("/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/4/Problem3_2.jpg")
grid.raster(img)

The following code simulates \(n=1000\) observations from the specified mixture model above. It uses a for-loop for pedagogical reasons. Some explanation of the code follows.

mu1 <- -2
sigma1 <- 0.5
mu2 <- 4
sigma2 <- 1.5
pi1 <- 0.2
pi2 <- 1 - pi1
# Store in vectors:
mu <- c(mu1, mu2)
sigma <- c(sigma1, sigma2)
pis <- c(pi1, pi2)
n <- 1000
y <- rep(NA, n)
x <- rep(NA, n)

for(i in 1:n){
  # Simulate indicator with probability pi2 (1 - component 2, 0 - component 1)
  y[i] <- rbinom(n = 1, size=1, prob = pis[2]) + 1
  x[i] <- rnorm(n = 1, mean = mu[y[i]], sd = sigma[y[i]])
}

# Sanity check: Confirm the marginal class probabilities are pi1 and pi2 (approx)
cat("Estimated class 1 marginal probability: ", mean(y == 1), ". True: ", pis[1], sep = "")
Estimated class 1 marginal probability: 0.209. True: 0.2
cat("Estimated class 2 marginal probability: ", mean(y == 2), ". True: ", pis[2], sep = "")
Estimated class 2 marginal probability: 0.791. True: 0.8
# Normalised histogram
hist(x, xlab = "x", col = "cornflowerblue", main = "Normalised histogram of the simulated x", prob = TRUE)

Most of the code above is obvious except that within the for-loop. The function rbinom() simulates n=1 random number from a binomial distribution with size=1 trials and success probability prob=pis[2]. This is just a Bernoulli variable that takes on the value 0 and 1 with, respectively, probabilities \(\pi_1\) and \(\pi_2\). We add 1 to this to make the outcome \(m=1,2\) (instead of \(m=0,1\)). The next line of code simulates from a normal distribution, where the mean and variance depend on the class membership of the observation (stored in y[i]).

Plotting a (normalised) histogram of the simulated data (x) is an empirical representation of the marginal density \(p(x)\) you derived in Problem 3.2. By comparing them, we can ensure that the simulation above is correct (and that the expression you derived for \(p(x)\) is correct!).

💪 Problem 3.3

Plot the \(p(x)\) you derived in Problem 3.2 in the same figure as a (normalised) histogram.

Tip

Create a fine grid of \(x\) values (e.g. x_grid<-seq(-6,10,length.out=1000)) and evaluate \(p(x)\) on the grid. Make sure to set prob=TRUE in the hist() function (so the plots are in the same scale). Note that with a few bins it is hard to evaluate the quality of the approximation. Try setting the argument breaks=30 in the hist() function.

x_grid <- seq(-6, 10, length.out = 1000)

p_x <- 0.2 / sqrt(0.5*pi) * exp(-(x_grid+2)^2 / 0.5) + 
  0.8 / sqrt(4.5*pi) * exp(-(x_grid-4)^2 / 4.5)

hist(x, xlab = "x", col = "cornflowerblue", main = "Normalised histogram of the simulated x with p(x)",
     prob = TRUE, breaks = 30)

# Overlay the density function p(x)
lines(x_grid, p_x, col = "red", lwd = 2)

4. Unsupervised learning via the EM algorithm (5 marks)

The expectation-maximisation (EM) algorithm is a powerful algorithm to learn the parameters in unsupervised learning models. In addition, the EM algorithm gives us the conditional (on the training data) distribution of the class membership for each observation, which we can use for classification. We will first learn how to implement the EM algorithm for the unsupervised Gaussian mixture model for a single feature \(x\) and two classes, which we studied in Problem 3. We stress that the difference compared to Problems 1 and 2 (which also had Gaussian components) is that the labels are not observed in this problem.

The following function implements the EM algorithm. The implementation assumes a single feature and two classes, i.e. \(x\in\mathbb{R}\) and \(M=2\).

EM_GMM_M2 <- function(x, mu_start, sigma_start, pis_start, n_iter = 100) {
  # Estimates the parameters in an unsupervised Gaussian mixture model with M = 2 classes. 
  # Runs the EM algorithm for n_iter iterations. x is assumed univariate. 
  # mu_start, sigma_start, pis_start are starting values.
  stopifnot(sum(pis_start) == 1)
  # Quantities to save
  pis_all <-  matrix(NA, nrow = n_iter, ncol = 2)
  mu_all <- matrix(NA, nrow = n_iter, ncol = 2)
  sigma_all <- matrix(NA, nrow = n_iter, ncol = 2)
  Q_all <- rep(NA, n_iter)
  log_like_all <- rep(NA, n_iter)
  
  # Initialise
  mu <- mu_start
  sigma <- sigma_start
  pis <- pis_start
  n <- length(x)
  W <- matrix(0, nrow = n, ncol = 2) # Unnormalised weights for each observation
  log_pdf_class <- matrix(0, nrow = n, ncol = 2) # The log-likelihood of the two classes for each obs. To compute Q. 
  for(j in 1:n_iter){
    # Start EM steps
    # E-step: Compute the expected log-likelihood Q
    for(m in 1:2){
      # The log-density for each class
      log_pdf_class[, m] <- dnorm(x, mean = mu[m], sd = sigma[m], log = TRUE) + log(pis[m])
      # Unnormalised weights
      W[, m] <- pis[m]*dnorm(x, mean = mu[m], sd = sigma[m])
    }
    w <- W/rowSums(W) # Normalise weights
    n_hat <- colSums(w) # Expected number of obs per class
    Q <- sum(rowSums(w*log_pdf_class)) # Expected log-likelihood
    
    # M-step: Maximise Q. Closed form analytical solution in Gaussian mixture models
    for(m in 1:2){
      pis[m] <- n_hat[m]/n
      mu[m] <- 1/n_hat[m]*sum(w[, m]*x)
      sigma[m] <- sqrt(1/n_hat[m]*sum(w[, m]*(x - mu[m])^2))
    }
    # End EM steps. Save estimates, Q, and log-likelihood
    pis_all[j, ] <- pis
    mu_all[j, ] <- mu
    sigma_all[j, ] <- sigma
    Q_all[j] <- Q
    # Compute log-likelihood at current parameter values
    for(m in 1:2){
      # Unnormalised weights
      W[, m] <- pis[m]*dnorm(x, mean = mu[m], sd = sigma[m])
    }
    log_like_all[j] <-  sum(log(rowSums(W)))
  } # End EM iterations
  # Return everything as a list
  return(list(pi_hat = pis, mu_hat = mu, sigma_hat = sigma, 
              weights = W/rowSums(W), pis_all = pis_all, 
              mu_all = mu_all, sigma_all = sigma_all, Q_all = Q_all, 
              log_like_all = log_like_all))
}

The log-likelihood is guaranteed to not decrease at any iteration, which makes the variable log_like_all above useful for debugging. The following code uses the function above to estimate the parameters using our simulated data from Problem 3, performs some convergence checks, and plots the class posterior probability distribution for an observation.

# Initial values
pis_start <- c(0.5, 0.5)
mu_start <- c(1, 4)
sigma_start <- c(1, 3)
n_iter <- 20
EM_result <- EM_GMM_M2(x, mu_start, sigma_start, pis_start, n_iter = n_iter)

# Visualise convergence for each parameters (adding starting value)
matplot(0:n_iter, rbind(pis_start, EM_result$pis_all), main = 'pis', pch = c("o", "o"), col = c("cornflowerblue", "lightcoral"), xlab = "Iteration", ylab = "pis", ylim = c(0, 1.5))
legend("topright", legend = c("Class 1", "Class 2"), col = c("cornflowerblue", "lightcoral"), pch = c("o", "o"))

matplot(0:n_iter, rbind(mu_start, EM_result$mu_all), main = 'mu', pch = c("o", "o"), col = c("cornflowerblue", "lightcoral"), xlab = "Iteration", ylab = "mu", ylim = c(-3, 6))
legend("topright", legend = c("Class 1", "Class 2"), col = c("cornflowerblue", "lightcoral"), pch = c("o", "o"))

matplot(0:n_iter, rbind(sigma_start, EM_result$sigma_all), main = 'sigma', pch = c("o", "o"), col = c("cornflowerblue", "lightcoral"), xlab = "Iteration", ylab = "sigma", ylim = c(0, 4))
legend("topright", legend = c("Class 1", "Class 2"), col = c("cornflowerblue", "lightcoral"), pch = c("o", "o"))

par(mfrow = c(1, 1))
# Inspect convergence
plot(EM_result$log_like_all, main = 'Log-likelihood', type = "l", col = "cornflowerblue", xlab = "Iteration", ylab = "Log-likelihood")

plot(EM_result$Q_all, main = 'Expected log-likelihood (Q)', type = "l", col = "cornflowerblue", xlab = "Iteration", ylab = "Log-likelihood")

print("True parameters (pi, mu, and sigma)")
[1] "True parameters (pi, mu, and sigma)"
print(pis)
[1] 0.2 0.8
print(mu)
[1] -2  4
print(sigma)
[1] 0.5 1.5
print("Estimated parameters (pi, mu, and sigma)")
[1] "Estimated parameters (pi, mu, and sigma)"
print(EM_result$pi_hat)
[1] 0.2086633 0.7913367
print(EM_result$mu_hat)
[1] -2.096384  3.983013
print(EM_result$sigma_hat)
[1] 0.479649 1.465332
# Inspect the classification probabilities of observation 10
ind <- 10
barplot(names.arg = c("Class 1", "Class 2"), EM_result$weights[ind, ], col = "cornflowerblue", ylim = c(0, 1), main = paste("Class (posterior) probability observation ", ind, sep = ""))

💪 Problem 4.1

Explain the label-switching problem when estimating unsupervised Gaussian mixture models. Do you observe label-switching above?

In unsupervised Gaussian mixture models, label-switching occurs because clusters are all unlabeled. The model estimates cluster parameters but doesn't fix labels to them, so labels like "Class 1" and "Class 2" can switch between iterations.

This doesn't affect the clustering outcome, only the interpretation of which parameters correspond to which label.

In this case, we don't see the sign of label-switching. The EM algorithm has consistently converged, maintaining stable clusters for each set of parameters. And our estimated parameters don't converge into the same values.

The next problem uses the EM algorithm to estimate an unsupervised Gaussian mixture model for the insects data. In particular, the model will be used to classify three insects: observations 6, 244, and 421 in insects.Rdata. The following code plots a (normalised) histogram of the data and highlights the feature values of these three observations.

hist(insects$length, col = "cornflowerblue", main = "Histogram of insects' lengths", prob = TRUE, xlab = "Length", ylim = c(0, 0.4), xlim = c(0, 14))
abline(v = insects[6, ], lwd = 1.5, col = "lightcoral")
abline(v = insects[244, ], lwd = 1.5, col = "purple")
abline(v = insects[421, ], lwd = 1.5, col = "lightpink")
legend("topright", legend = c("Obs 6", "Obs 244", "Obs 421"), col = c("lightcoral", "purple", "lightpink"), lwd = c(1.5, 1.5, 1.5))

💪 Problem 4.2

Use the EM_GMM_M2() function to estimate an unsupervised Gaussian mixture model for the insect data in Problem 3. Analyse the convergence. Compare the parameter estimates to those obtained by the function normalmixEM() in the mixtools package.

Tip

The EM algorithm is sensitive to starting values. Try a few different starting values and pick the solution that gives you the highest log-likelihood value. You might have to increase the number of iterations if you have not achieved convergence yet. Finally, recall that the log-likelihood is guaranteed to not decrease at each iteration.

suppressMessages(library(mixtools))

mu_start <- c(4, 8) # 2 peaks in insect histogram
sigma_start <- c(1, 1) 
pis_start <- c(0.5, 0.5)
n_iter <- 25

EM_result <- EM_GMM_M2(insects$length, mu_start, sigma_start, pis_start, n_iter = n_iter)

plot(EM_result$log_like_all, main = 'Log-likelihood in EM_GMM_M2', type = "l", col = "cornflowerblue", xlab = "Iteration", ylab = "Log-likelihood")

mixtools_results <- normalmixEM(insects$length, k = 2)
number of iterations= 30 
plot(mixtools_results$all.loglik, main = "Log-Likelihood in normalmixEM", type = "l", col = "cornflowerblue", xlab = "Iteration", ylab = "Log-likelihood")

cat("EM_GMM_M2 parameters (pi, mu, and sigma)\n")
EM_GMM_M2 parameters (pi, mu, and sigma)
cat(EM_result$pi_hat, "\n")
0.7935567 0.2064433 
cat(EM_result$mu_hat, "\n")
3.994985 8.103937 
cat(EM_result$sigma_hat, "\n\n")
0.9905863 0.6674929 
cat("normalmixEM parameters (pi, mu, and sigma)\n")
normalmixEM parameters (pi, mu, and sigma)
cat(mixtools_results$lambda, "\n")
0.7935566 0.2064434 
cat(mixtools_results$mu, "\n")
3.994984 8.103936 
cat(mixtools_results$sigma, "\n\n")
0.9905859 0.6674937 
cat("Log-Likelihood (EM_GMM_M2):", max(EM_result$log_like_all), "\n")
Log-Likelihood (EM_GMM_M2): -1820.534 
cat("Log-Likelihood (normalmixEM):", mixtools_results$loglik, "\n")
Log-Likelihood (normalmixEM): -1820.534 
We can see that the EM_GMM_M2() function converges quicker than the normalmixEM() function.
They converge to almost the same parameter values and the same log-likelihood.

💪 Problem 4.3

Plot the class posterior probabilities for insects 6, 244, and 421 with the model estimated in Problem 4.2. Are the results as expected? Explain.

Tip

The function EM_GMM_M2() returns an \(n\times2\) matrix weights, which contains the class posterior probabilities evaluated at the parameter estimates from the EM algorithm.

posterior_insects <- EM_result$weights[c(6, 244, 421), ]

barplot(t(posterior_insects), beside = TRUE, 
        names.arg = c("Insect 6", "Insect 244", "Insect 421"),
        col = c("red", "blue"), 
        legend = c("Class 1", "Class 2"),
        ylim = c(0, 1),
        main = "Posterior probabilities for insects 6, 244, and 421",
        xlab = "Insect", ylab = "Posterior Probability")

As it is an unlabeled dataset, we can directly compare the results to see if it's well estimated.

The results are as expected here. According to the histogram just after the Problem 4.1, Insect 6 and Insect 421 should not belong to the same class which fits well with our estimation. And also, Insect 244 is more close to the class centerlized in length 8 (which includes Insect 421). Our model does estimate it as the same class as Insect 421 with higher probability.

💪 Problem 4.4

Write a function that implements the EM algorithm for the unsupervised Gaussian mixture model with any number of components \(M\), i.e. not limited to \(M=2\) as the EM_GMM_M2() function. You can still assume that \(x\) is univariate. Use your function to estimate two models for the dataset fish.Rdata with, respectively, \(M=3\) and \(M=4\) classes. The dataset can be downloaded from the Canvas page of the course. The dataset contains the lengths of 523 fishes.

Tip

Choose suitable starting values for the means by considering a histogram with 30 breaks of the data. Make sure you run enough iterations by monitoring the convergence via the log-likelihood and the expected log-likelihood.

load(file = '/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/4/fish.RData')

hist(fish$length, col = "cornflowerblue", main = "Histogram of insects' lengths", breaks = 30, prob = TRUE, xlab = "Length", ylim = c(0, 0.08), xlim = c(15, 70))

EM_GMM_M <- function(x, mu_start, sigma_start, pis_start, M, n_iter = 100) {
  # Estimates the parameters in an unsupervised Gaussian mixture model with M classes. 
  # Runs the EM algorithm for n_iter iterations. x is assumed univariate. 
  stopifnot(sum(pis_start) == 1)
  
  # Quantities to save
  pis_all <-  matrix(NA, nrow = n_iter, ncol = M)
  mu_all <- matrix(NA, nrow = n_iter, ncol = M)
  sigma_all <- matrix(NA, nrow = n_iter, ncol = M)
  Q_all <- rep(NA, n_iter)
  log_like_all <- rep(NA, n_iter)
  
  # Initialise
  mu <- mu_start
  sigma <- sigma_start
  pis <- pis_start
  n <- length(x)
  W <- matrix(0, nrow = n, ncol = M) # Unnormalised weights for each observation
  log_pdf_class <- matrix(0, nrow = n, ncol = M) # The log-likelihood of the classes for each obs. To compute Q.
  
  for(j in 1:n_iter){
    # E-step: Compute the expected log-likelihood Q
    for(m in 1:M){
      # The log-density for each class
      log_pdf_class[, m] <- dnorm(x, mean = mu[m], sd = sigma[m], log = TRUE) + log(pis[m])
      # Unnormalised weights
      W[, m] <- pis[m]*dnorm(x, mean = mu[m], sd = sigma[m])
    }
    w <- W/rowSums(W) # Normalise weights
    n_hat <- colSums(w) # Expected number of obs per class
    Q <- sum(rowSums(w*log_pdf_class)) # Expected log-likelihood
    
    # M-step: Maximise Q. Closed form analytical solution in Gaussian mixture models
    for(m in 1:M){
      pis[m] <- n_hat[m]/n
      mu[m] <- 1/n_hat[m]*sum(w[, m]*x)
      sigma[m] <- sqrt(1/n_hat[m]*sum(w[, m]*(x - mu[m])^2))
    }
    
    # Save estimates, Q, and log-likelihood
    pis_all[j, ] <- pis
    mu_all[j, ] <- mu
    sigma_all[j, ] <- sigma
    Q_all[j] <- Q
    
    # Compute log-likelihood at current parameter values
    for(m in 1:M){
      W[, m] <- pis[m]*dnorm(x, mean = mu[m], sd = sigma[m])
    }
    log_like_all[j] <- sum(log(rowSums(W)))
  }
  
  # Return everything as a list
  return(list(pi_hat = pis, mu_hat = mu, sigma_hat = sigma, 
              weights = W/rowSums(W), pis_all = pis_all, 
              mu_all = mu_all, sigma_all = sigma_all, Q_all = Q_all, 
              log_like_all = log_like_all))
}


M <- 3
mu_start <- c(20, 35, 60)
sigma_start <- rep(1, M)
pis_start <- rep(1/M, M) 

EM_result_3 <- EM_GMM_M(fish$length, mu_start, sigma_start, pis_start, M = 3, n_iter = 100)
plot(EM_result_3$log_like_all, main = 'Log-likelihood with 3 classes', type = "l", col = "cornflowerblue", xlab = "Iteration", ylab = "Log-likelihood")

M <- 4
mu_start <- c(20, 35, 50, 70)
sigma_start <- rep(1, M)
pis_start <- rep(1/M, M) 

EM_result_4 <- EM_GMM_M(fish$length, mu_start, sigma_start, pis_start, M = 4, n_iter = 100)
plot(EM_result_4$log_like_all, main = 'Log-likelihood with 4 classes', type = "l", col = "cornflowerblue", xlab = "Iteration", ylab = "Log-likelihood")

💪 Problem 4.5

Plot \(p(x)\) (using the estimates from your EM algorithm) for the two models in Problem 4.4 in the same figure as a (normalised) histogram obtained with the hist() with the argument breaks=30. Which of the two models seem visually better for modelling the fishs’ lengths?

x_grid <- seq(min(fish$length), max(fish$length), length.out = 1000)

# For M = 3
p_x_M3 <- EM_result_3$pi_hat[1] * dnorm(x_grid, mean = EM_result_3$mu_hat[1], sd = EM_result_3$sigma_hat[1]) + 
  EM_result_3$pi_hat[2] * dnorm(x_grid, mean = EM_result_3$mu_hat[2], sd = EM_result_3$sigma_hat[2]) + 
  EM_result_3$pi_hat[3] * dnorm(x_grid, mean = EM_result_3$mu_hat[3], sd = EM_result_3$sigma_hat[3])

# For M = 4
p_x_M4 <- EM_result_4$pi_hat[1] * dnorm(x_grid, mean = EM_result_4$mu_hat[1], sd = EM_result_4$sigma_hat[1]) + 
  EM_result_4$pi_hat[2] * dnorm(x_grid, mean = EM_result_4$mu_hat[2], sd = EM_result_4$sigma_hat[2]) + 
  EM_result_4$pi_hat[3] * dnorm(x_grid, mean = EM_result_4$mu_hat[3], sd = EM_result_4$sigma_hat[3]) + 
  EM_result_4$pi_hat[4] * dnorm(x_grid, mean = EM_result_4$mu_hat[4], sd = EM_result_4$sigma_hat[4])

hist(fish$length, breaks = 30, col = "white", prob = TRUE, 
     main = "Histogram of fish lengths with GMM density estimates", xlab = "Length", ylim = c(0, 0.08))

# Overlay the density estimate for M = 3
lines(x_grid, p_x_M3, col = "red", lwd = 2, lty = 1)

# Overlay the density estimate for M = 4
lines(x_grid, p_x_M4, col = "blue", lwd = 2, lty = 2)

legend("topright", legend = c("M = 3", "M = 4"), col = c("red", "blue"), lty = c(1, 2), lwd = 2)

Visually, it seems the M=4 model fits better to the fish's lengths. They are quite similar, but the 4 classes model has a more accurate estimation in the range from 40 to 60.

When the features are multivariate, such as in the example with the ASB stock above, the code for the EM algorithm requires some modification. However, the understanding and application of unsupervised Gaussian mixtures is the same as in the univariate case. The next problem uses a package to estimate an unsupervised (multivariate) Gaussian mixture model for the ASB stock application.

💪 Problem 4.6

Use the function mvnormalmixEM() from the mixtools package to estimate an unsupervised (multivariate) Gaussian mixture model with \(M=2\) classes to the ASB stock example with feature vector \(\mathbf{x}=(x_1,x_2)^\top\), with \(x_1=\text{Close}\) and \(x_2=\text{log(Volume)}\). Use all observations as training data. Plot a scatter plot with the predicted classes for the training data (use different colors to represent the different classes).

x1 <- log(asb$Volume)
x2 <- asb$Close
data_matrix <- cbind(x1, x2)

gmm_model <- mvnormalmixEM(data_matrix, k = 2)
number of iterations= 24 
predicted_classes <- apply(gmm_model$posterior, 1, which.max)

plot(x1, x2, col = predicted_classes,  pch = 1,
     xlab = "Close", ylab = "log(Volume)", 
     main = "Gaussian Mixture Model Classification")
legend("topright", legend = c("Class 1", "Class 2"), 
       col = 1:2, pch = 1)

5. Semi supervised learning (3 marks)

In Problems 1 and 2 we assumed that all labels were observed, whereas none of them were observed in Problems 3 and 4. Another scenario is that labels are partially observed, i.e. we know the labels for a set of observations only. Statistical learning and modelling under this assumption is referred to as semi supervised learning.

One of the most common applications of semi supervised learning is in missing data problems. In this problem, we study the capability of semi supervised learning to learn parameters accurately compared to a fully supervised learning algorithm. Moreover, we compare the accuracy of predictions that account for missing labels to those who ignore the observations with missing labels.

We consider the problem of classifying penguins into species. The data are stored in the file palmer_penguins_missing.Rdata, which can be downloaded from the Canvas page of the course3. The dataset contains a set of features for 265 penguins of two different species, Adelie and Gentoo. The species variable is missing for 49 of the penguins. The following code reads in the data and plots the classes and missing labels in a two-dimensional feature space consisting of the penguin’s flipper length (length of the wing) and body mass.

load(file = '/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/4/palmer_penguins_missing.RData')
# Get indices to the classes
ind_gentoo <- (palmer_penguins_missing$species == "Gentoo")
ind_adelie <- (palmer_penguins_missing$species == "Adelie")
ind_missing <- is.na(palmer_penguins_missing$species)

x1 <- palmer_penguins_missing$flipper_length_mm
x2 <- palmer_penguins_missing$body_mass_g
plot(x1[ind_gentoo], x2[ind_gentoo], type = "p", col = "cornflowerblue", xlim = c(170, 250), ylim = c(2500, 7000), xlab = "Flipper length", ylab = "Body mass") 
lines(x1[ind_adelie], x2[ind_adelie], type = "p", col = "lightcoral") 
lines(x1[ind_missing], x2[ind_missing], type = "p", pch = "?", cex = 0.8, col = "black")
legend("topright", legend = c("Gentoo", "Adelie", "Missing"), col = c("cornflowerblue", "lightcoral", "black"), pch = c("o", "o", "?"))

For simplicity we use a single feature, the flipper length, in the problems below.

💪 Problem 5.1

Estimate a (fully) supervised Gaussian mixture model for the penguin data with species as labels and a single feature flipper length. Use separate means and variances for the classes.

Tip

A supervised Gaussian mixture assumes all labels are known. We can remove the observations with missing labels by creating a new data frame with no missing values as df_no_missing<-na.omit(palmer_penguins_missing).

df_no_missing <- na.omit(palmer_penguins_missing)

adelie_data <- df_no_missing[df_no_missing$species == "Adelie", ]
gentoo_data <- df_no_missing[df_no_missing$species == "Gentoo", ]

flipper_adelie <- adelie_data$flipper_length_mm
flipper_gentoo <- gentoo_data$flipper_length_mm

mean_adelie <- mean(flipper_adelie)
var_adelie <- var(flipper_adelie)

mean_gentoo <- mean(flipper_gentoo)
var_gentoo <- var(flipper_gentoo)

cat("Adelie - Mean:", mean_adelie, "Variance:", var_adelie, "\n")
Adelie - Mean: 190.7679 Variance: 21.99968 
cat("Gentoo - Mean:", mean_gentoo, "Variance:", var_gentoo, "\n")
Gentoo - Mean: 215.7404 Variance: 25.35913 
# Plot the data and the Gaussian distributions
hist(flipper_adelie, breaks = 10, freq = FALSE, col = "lightcoral", 
     xlim = c(170, 250), xlab = "Flipper Length (mm)", main = "Flipper Length Distribution")
curve(dnorm(x, mean = mean_adelie, sd = sqrt(var_adelie)), 
      add = TRUE, col = "red", lwd = 2)

hist(flipper_gentoo, breaks = 10, freq = FALSE, col = "cornflowerblue", add = TRUE)
curve(dnorm(x, mean = mean_gentoo, sd = sqrt(var_gentoo)), 
      add = TRUE, col = "blue", lwd = 2)

legend("topright", legend = c("Adelie", "Gentoo"), 
       col = c("red", "blue"), lwd = 2)

💪 Problem 5.2

Estimate a semi supervised Gaussian mixture model for the penguin data with species as labels and a single feature flipper length. Compare the parameter estimates to those in Problem 5.1. Comment on the result.

Tip

As discussed in the lecture, semi supervised learning can be cast as a special version of the EM algorithm. Adapt the EM_GMM_M2 function to take an extra argument that contains a vector of labels (with NA for unknown labels) and construct the weights accordingly. Note that the log-likelihood of the non-missing observations is monitored (for convergence) in semi supervised learning. This requires modifying log_like_all.

EM_GMM_M2_semi <- function(x, species, mu_start, sigma_start, pis_start, n_iter = 100) {
  stopifnot(sum(pis_start) == 1)
  
  pis_all <- matrix(NA, nrow = n_iter, ncol = 2)
  mu_all <- matrix(NA, nrow = n_iter, ncol = 2)
  sigma_all <- matrix(NA, nrow = n_iter, ncol = 2)
  Q_all <- rep(NA, n_iter)
  log_like_all <- rep(NA, n_iter)
  
  # Initialise
  mu <- mu_start
  sigma <- sigma_start
  pis <- pis_start
  n <- length(x)
  W <- matrix(0, nrow = n, ncol = 2) # Unnormalized weights for each observation
  log_pdf_class <- matrix(0, nrow = n, ncol = 2) # Log-likelihood of the two classes for each obs.
  
  for (j in 1:n_iter) {
    # Start EM steps
    
    # E-step: Compute the expected log-likelihood Q
    for (m in 1:2) {
      # The log-density for each class
      log_pdf_class[, m] <- dnorm(x, mean = mu[m], sd = sigma[m], log = TRUE) + log(pis[m])
      # Unnormalized weights
      W[, m] <- pis[m] * dnorm(x, mean = mu[m], sd = sigma[m])
    }
    
    # Handle labeled data: set weights to 1 for the correct class and 0 for the other class
    w <- W / rowSums(W) # Normalize weights
    for (i in 1:n) {
      if (!is.na(species[i])) {
        if (species[i] == "Class1") {
          w[i, ] <- c(1, 0)  # Class 1: Weight for Class 1 is 1, for Class 2 is 0
        } else if (species[i] == "Class2") {
          w[i, ] <- c(0, 1)  # Class 2: Weight for Class 2 is 1, for Class 1 is 0
        }
      }
    }
    
    n_hat <- colSums(w) # Expected number of obs per class
    Q <- sum(rowSums(w * log_pdf_class)) # Expected log-likelihood
    
    # M-step: Maximize Q. Closed form analytical solution in Gaussian mixture models
    for (m in 1:2) {
      pis[m] <- n_hat[m] / n
      mu[m] <- 1 / n_hat[m] * sum(w[, m] * x)
      sigma[m] <- sqrt(1 / n_hat[m] * sum(w[, m] * (x - mu[m])^2))
    }
    
    # Save estimates, Q, and log-likelihood
    pis_all[j, ] <- pis
    mu_all[j, ] <- mu
    sigma_all[j, ] <- sigma
    Q_all[j] <- Q
    
    # Compute log-likelihood at current parameter values
    for (m in 1:2) {
      W[, m] <- pis[m] * dnorm(x, mean = mu[m], sd = sigma[m])
    }
    log_like_all[j] <- sum(log(rowSums(W)))
  } # End EM iterations
  
  # Return everything as a list
  return(list(pi_hat = pis, mu_hat = mu, sigma_hat = sigma,
              weights = W / rowSums(W), pis_all = pis_all,
              mu_all = mu_all, sigma_all = sigma_all, Q_all = Q_all,
              log_like_all = log_like_all))
}

mu_start <- c(190, 215) 
sigma_start <- c(5, 5) 
pis_start <- c(0.5, 0.5)
n_iter <- 100

EM_result <- EM_GMM_M2_semi(palmer_penguins_missing$flipper_length_mm, palmer_penguins_missing$species, mu_start, sigma_start, pis_start, n_iter = 100)

print("Estimated parameters (pi, mu, and sigma)")
[1] "Estimated parameters (pi, mu, and sigma)"
print(EM_result$pi_hat)
[1] 0.5321285 0.4678715
print(EM_result$mu_hat)
[1] 189.6257 216.6867
print(EM_result$sigma_hat)
[1] 6.028273 7.020769
This semi-supervised model's means are really close to those of the fully supervised model.
But this model has bigger variances comparing to the supervised one (~= 6^2>22 and 7^2>25).

💪 Problem 5.3

The dataset palmer_penguins.Rdata, which can be downloaded from the Canvas page of the course, contains the true values for the labels that are missing in palmer_penguins_missing.Rdata. Compute the accuracy of the predictions for these observations using your semi supervised Gaussian mixture model in Problem 5.2. Compare that to the classification obtained via your supervised Gaussian mixture model in Problem 5.1. Comment on the result.

Tip

When using the supervised Gaussian mixture model in Problem 5.1 for classification, recall that for a test observation \(x_\star\), the prediction is \[\widehat{y}(x_\star)=\arg\max_m \Pr(y=m|x_\star)=\arg\max_m p(x_\star|y=m)\Pr(y=m).\]

The parameters in the Gaussian class conditionals \(p(x_\star|y=m)\) and the marginal class probabilities \(\Pr(y=m)\) are estimated from the labelled training data (i.e. without missing label observations).

load('/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/4/palmer_penguins_missing.RData')
load('/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/4/palmer_penguins.RData')

x_missing <- palmer_penguins_missing$flipper_length_mm
species_missing <- palmer_penguins_missing$species
true_labels <- palmer_penguins$species

missing_indices <- which(is.na(species_missing))
true_labels_missing <- true_labels[missing_indices]

mu_start <- c(190, 215) 
sigma_start <- c(6, 7) 
pis_start <- c(0.5, 0.5)
n_iter <- 100

EM_result <- EM_GMM_M2_semi(
  palmer_penguins_missing$flipper_length_mm, 
  palmer_penguins_missing$species, 
  mu_start, sigma_start, pis_start, 
  n_iter = n_iter
)

mu_semi <- EM_result$mu_hat
sigma_semi <- EM_result$sigma_hat
pis_semi <- EM_result$pi_hat

# Define the prediction function
predict_labels <- function(x, mu, sigma, pis) {
  log_prob_1 <- log(pis[1]) + dnorm(x, mean = mu[1], sd = sigma[1], log = TRUE)
  log_prob_2 <- log(pis[2]) + dnorm(x, mean = mu[2], sd = sigma[2], log = TRUE)
  
  ifelse(log_prob_1 > log_prob_2, "Adelie", "Gentoo")
}


# Predict missing labels using the supervised model
mu_supervised <- c(190.7679, 215.7404)
sigma_supervised <- c(sqrt(21.99968), sqrt(25.35913))
pis_supervised <- c(0.5, 0.5) 

supervised_preds <- predict_labels(x_missing[missing_indices], 
                                   mu_supervised, sigma_supervised, pis_supervised)

# Predict missing labels using the semi-supervised model
semi_supervised_preds <- predict_labels(x_missing[missing_indices], 
                                        mu_semi, sigma_semi, pis_semi)

accuracy_supervised <- mean(supervised_preds == true_labels_missing)
accuracy_semi_supervised <- mean(semi_supervised_preds == true_labels_missing)

cat("Supervised Accuracy:", accuracy_supervised, "\n")
Supervised Accuracy: 0.9795918 
cat("Semi-Supervised Accuracy:", accuracy_semi_supervised, "\n")
Semi-Supervised Accuracy: 0.9795918 
# Confusion matrix for supervised GMM
conf_matrix_supervised <- confusionMatrix(
  factor(supervised_preds, levels = c("Adelie", "Gentoo")),
  factor(true_labels_missing, levels = c("Adelie", "Gentoo"))
)

# Confusion matrix for semi-supervised GMM
conf_matrix_semi <- confusionMatrix(
  factor(semi_supervised_preds, levels = c("Adelie", "Gentoo")),
  factor(true_labels_missing, levels = c("Adelie", "Gentoo"))
)

cat("Confusion Matrix - Supervised GMM\n")
Confusion Matrix - Supervised GMM
print(conf_matrix_supervised)
Confusion Matrix and Statistics

          Reference
Prediction Adelie Gentoo
    Adelie     33      0
    Gentoo      1     15
                                          
               Accuracy : 0.9796          
                 95% CI : (0.8915, 0.9995)
    No Information Rate : 0.6939          
    P-Value [Acc > NIR] : 3.778e-07       
                                          
                  Kappa : 0.9528          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.9706          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.9375          
             Prevalence : 0.6939          
         Detection Rate : 0.6735          
   Detection Prevalence : 0.6735          
      Balanced Accuracy : 0.9853          
                                          
       'Positive' Class : Adelie          
                                          
cat("\nConfusion Matrix - Semi-Supervised GMM\n")

Confusion Matrix - Semi-Supervised GMM
print(conf_matrix_semi)
Confusion Matrix and Statistics

          Reference
Prediction Adelie Gentoo
    Adelie     33      0
    Gentoo      1     15
                                          
               Accuracy : 0.9796          
                 95% CI : (0.8915, 0.9995)
    No Information Rate : 0.6939          
    P-Value [Acc > NIR] : 3.778e-07       
                                          
                  Kappa : 0.9528          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.9706          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.9375          
             Prevalence : 0.6939          
         Detection Rate : 0.6735          
   Detection Prevalence : 0.6735          
      Balanced Accuracy : 0.9853          
                                          
       'Positive' Class : Adelie          
                                          
To conclude, in the single feature flipper length case, the fully supervised model and the semi-supervised model have the same performance because of their identical confusion matrix.
End of the course!

If you made it all the way here without losing too much hair (now you know how I became bald) you should be proud of yourself!

I hope you enjoyed the course and good luck on the final project. The project will be a lighter version of the computer labs, in particular since you may reuse computer lab solutions.

Footnotes

  1. The original data come from this source.↩︎

  2. The original data come from this source.↩︎

  3. The original data come from this source.↩︎